Lack of data governance and stewardship is a major concern for enterprise and solution architects. Data mesh is an emerging data architecture that includes a radically different approach to data governance. Will it resolve the concerns?
I attended the 2022 Datanova virtual event hoping to get an answer to that quesion. Here's what I found.
Data Mesh
Data Mesh is a decentralized data architecture in which responsibility for data is given to people with expertise in its subject matter, and they deliver data as a product. Centralized data estates are thus replaced by meshes of independently governed data products. The idea was introduced by Zhamak Dehghani of Thoughtworks in 2019. (If you are new to it, read her article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh or the article by Martin Fowler on Data Mesh Principles and Logical Architecture).
Datanova was a Starburst event with sponsorship from a number of other leading data and technology companies. Starburst provides data analytics platforms based on Trino (formerly PrestoSQL), an open source fast distributed SQL query engine. It can be used in a data mesh to derive data products and deliver operational data from them. To their credit, Datanova was not just a Starburst product pitch but included neutral presentations and discussions from a variety of industry experts.
To declare my interest, I am a fan of data mesh. My company Lacibus has implemented a data platform that, like Starburst, can be used in a data mesh, but based on a different technical approach. This article is not about the relative merits of different technical approaches, but about data governance, where I hope there is common ground.
Data Products
Several speakers brought out the advantages of data products and decentralized governance.
Zhamak Dehghani explained that data mesh enables small groups of people to focus on particular domains, creating data products that consumers can access without going through centralised governance. The data product is the unit of exchange for value, and this gives people the right focus. A centralised group is too far from the users to understand what they need. Domain teams are better placed because they can work directly with the users.
Small business design and marketing partner Vista adopted data mesh 18 months ago, and now has more than a hundred data products including self-service dashboards. Sebastian Klapdor, its EVP and CDO, told its success story, and stressed the benefit that product teams can really focus on their customers. Richard Jarvis, CTO at EMIS Health, and Mahesh Lagishetty, VP of Data Engineering at global payments company TSYS, with his Director/Lead Data Architect colleague Hariharan Banukumar, also gave positive customer presentations.
Daniel Abadi, professor of Computer Science at the University of Maryland, agreed with Zhamak that we need to get rid of central control. He distinguished data mesh from data fabric, which also tries to do this, but by automating what the central group does, and reducing its size. He said that data mesh doesn't try to get rid of people, but to "use them smarter".
Max Schultze, data engineering manager at online fashion store Zalando and Arif Wider, software engineering professor at HTW Berlin and technology consultant with Thoughtworks, shared the conclusions from their book Data Mesh in Practice. Centralized data ownership does not scale well organizationally, not so much because of the number of data sources and use cases, but because of their complexity. Data mesh solves the problem of scale in getting value in the face of complexity.
Domain Products and Derived Products
So we scrap our central data team, introduce decentralized data product teams, and get scalable, customer-focused data. Wonderful! Can we open the Champagne now?
Unfortunately, it's not that simple.
Data products do not all correspond to business domains. Those that do - marketing data products, sales data products, production data products, and so on - are the starting point. There are other data products that are derived from them. A business dashboard, for example, might include marketing, sales, and production data. And there can be many stages of derivation. A dashboard for business executives could be at the end of a long data pipeline with many domain feeds. An enterprise will have many such pipelines, woven together to form its data mesh.
In her data product master class, Teresa Tung, Accenture’s Cloud First Chief Technologist, explained that derived products are data products formed by application logic and analytics that may change format and add propositions, observations, and feeds from core services. A data product is typically created by a team that includes architects and data specialists. Their focus is on data value. The business sponsor of a product values the data as an asset. A derived data product can still have added value and can have a business sponsor.
Does this mean that creation of a data product is a formal process in which a business sponsor recruits a team to produce and distribute the data, with responsibility for its availability and quality? Sometimes yes but, more usually, no.
Creating Products "On The Fly"
Vishal Singh, Head of Data Products at Starburst, demonstrated how to use the Starburst Enterprise interface to create data products. He showed how to create a data product "on the fly" by setting up a new view of the data to meet a specific need of a consumer.
With Starburst, and other data platforms that use virtualization, it is possible to create, quickly and easily, views that join selected data from different sources. The data product that you need can often be prepared while you wait. You are the business sponsor, but you don't have to recruit a team to get your product.
If there is no responsible team, what about availability and quality? This is a critical question for a data mesh. Its answer depends on the level of co-ordination between the teams and individuals responsible for domain data products and derived data products.
Co-ordinating Data Products
For an enterprise data mesh to deliver meaningful results, co-ordination between data products is needed in a number of areas. They include (with quality last but definitely not least):
- Semantics. Integration of data from different products requires that they have common, or at least translateable, semantics. Max and Arif gave an example of the kind of semantic mismatch that often prevents successful integration: two data sets, both having a concept of "user", but with it referring to a long-lived record in one data set and to a transient session id in the other.
- Access Control. Access to particular data elements is often restricted, for commercial confidentiality, personal data protection, national security, or other reasons. There must be an access control framework that ensures that the appropriate restrictions are enforced, not just at the start of a data pipeline, but to all data products in the pipeline. And, as Starburst’s Director of Engineering Colleen Tartow pointed out, a problem at scale with policy-based access control is that every piece of data has access policies and there is an explosion when there are lots of data sources and lots of users. The access control framework must somehow accommodate this.
- Discoverability. Many enterprises have data catalogs, so that consumers can find the data products that they need easily.
- Use Metrics. Common metrics are needed for comparability of data products across an enterprise. It should be possible to measure the use of a data product in terms of number of queries, CPU cycles, etc.
- Quality. Data quality can have many dimensions, and few data sets are 100% perfect in any of them. There must be a quality framework that ensures that quality is measured and is as high as possible along the length of every data pipeline. This requires automation of quality checks. It could also include a concept of certification of data products. Barr Moses, CEO and Co-Founder of data reliability specialists Monte Carlo, gave an interesting presentation on Data Observability - an organisation's ability to fully understand the health of the data in its systems. To enable this, each product should have data reliability service level indicators, objectives and agreements (SLIs, SLOs, and SLAs), and data product definitions should include APIs to quality metrics.
Data Governance
Let's return to the question of data governance. On the one hand, decentralised domain teams can work directly with the users and so are better placed to understand their needs. On the other, there must be co-ordination between product teams for an enterprise data mesh to deliver meaningful results. So can the decentralized teams really be independent?
This dilemma is not unique to data management. The need to balance regional autonomy and central control arises in most spheres of human activity (not least, politics). Enterprise and solution architects are very familiar with the tensions that can arise between a centrally-appointed enterprise architecture team and teams in divisions or departments.
The data mesh recommendation, put forward by Zhamak in her closing presentation, is for federated governance, with enablement groups serving as co-ordination committees between domain data product teams and platform teams. This may well be an effective approach, but each enterprise will probably form its own view. Decisions achieved by consensus are powerful because the affected parties are committed to them, but can take a long time to reach.
The Data Products Principle
A large enterprise is unlikely to have a pure data mesh architecture - or a pure data architecture of any kind. It will have a mix of data platforms. The State of Data survey conducted by Enterprise Management Associates and presented at the event showed that over 60% of respondent's ecosystems had from 4 to 9 different data platforms, with another 10% having even more than that. How is such ungainly data sprawl to be governed?
Enterprise architects are familiar with this kind of situation. The best way to deal with it is by an Architecture Principle. Principles are seldom adopted instantly or followed uniformly, but they are a powerful means of steering the enterprise in the right direction.
A good principle for data products might be
- Every domain data set that is exposed outside its domain has a team that is responsible for its availability, quality, and adherence to corporate standards.
And that isn't a bad principle, even if you don't buy in to the idea of a data mesh.
Key Takeaways
Finally, here are my key takeaways from the event.
You should not install a data mesh in order to improve data governance. A data mesh will bring its own set of governance issues, concerned with creating derived products and ensuring the integrity and quality of data in data pipelines.
You should install a data mesh in order to give users the data that they need, quickly and at scale. There will be governance issues but, if you have problems of scale and diversity of data (and what large enterprise doesn't), the effort of updating your governance regime will be well worthwhile.
Regardless of whether you install a data mesh, you should consider adopting the data products principle. Replacing a central control bottleneck with a regime that allows data product teams to operate independently is good practice - and a better style of governance.